Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus
نویسندگان
چکیده
In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested and extensively applied for the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor the texts are aligned at the word level and word sense annotated with a shared inventory of senses. A number of experiments have been carried out to evaluate the different steps involved in the methodology and the results suggest that the transfer approach is one promising solution to the resource bottleneck. First, it leads to the creation of a parallel corpus, which represents a crucial resource per se. Second, it allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.
منابع مشابه
Evaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus
In this paper we illustrate and evaluate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. Th...
متن کاملBrowsing Multilingual Information with the MultiSemCor Web Interface
Parallel and comparable corpora represent a crucial resource for different Natural Language Processing tasks like machine translation, lexical acquisition, and knowledge structuring but are also suitable to be consulted by humans for different purposes, such as linguistic teaching, corpus linguistics, translation studies, lexicography, multilingual information browsing. To enhance their exploit...
متن کاملParallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastructure
Multilingual technologies, which to a large extent are language independent, provide a powerful support for easier building of annotated linguistic resources for languages where such resources are scarce or missing. All these technologies require parallel corpora in order to achieve their ends. Parallel texts encode extremely valuable linguistic knowledge because the linguistic decisions made b...
متن کاملOpportunistic Semantic Tagging
Building semantically annotated corpora from scratch is a time consuming activity requiring very specialized resources. In this paper we present a pilot study carried out to test a methodology that can be used to create a semantically annotated corpus by exploiting information contained in an already annotated corpus. The main hypothesis underlying the proposed methodology is that, given a text...
متن کاملCrossing Parallel Corpora and Multilingual Lexical Databases for WSD
Word Sense Disambiguation (WSD) is the task of selecting the correct sense of a word in a context from a sense repository. Typically, WSD is approached as a supervised classification task to get state-of-the-art performance (e.g. [6]), and thus a large amount of sense-tagged examples for each sense of the word is needed, according to the word-expert approach. This requirement makes the supervis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Natural Language Engineering
دوره 11 شماره
صفحات -
تاریخ انتشار 2005